-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to preserve the original ordering of columns #75
Conversation
Summary: - It's nice to have the option of keeping the columns in the same order as the input file for json and dict input formats. - On a related note, I'm not sure what BigQuery's `autodetect` functionality uses to determine order, but as of today it is not alphabetical.
@bxparks -- thanks for all your work on this tool, it's quickly becoming an integral part of our data pipeline. I probably should have opened an issue to discuss this change before opening a pull request, but figured it was easy enough to make the code change and we can discuss from here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor formatting nits below.
I guess the bigger question is, does the sorting order matter for JSON input data? Because the order of keys in JSON file is undefined:
https://datatracker.ietf.org/doc/html/rfc7159#section-1
" An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array."
So the BQ schema, as represented by a JSON file, should be considered identical no matter how the keys are ordered.
I guess I had sorted the keys when printing out the schema, because bq show --schema
used to sort the keys. But if you are saying that it no longer sorts the keys, then what does it do? If the order is random, or a non-deterministic outcome of how the autodetect feature worked, then I am not sure I can replicate the exact order of those keys in this script. In other words, I'm not convinced that --preserve_input_sort_order
will produce a JSON schema file that is identical to the one produced by bg show --schema
.
@@ -1042,6 +1043,12 @@ def main(): | |||
' This can be fetched with:' | |||
' `bq show --schema <project_id>:<dataset>:<table_name>', | |||
default=None) | |||
parser.add_argument( | |||
'--keep-input-sort-order', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use --preserve_input_sort_order
using underscore for consistency with other flags.
@@ -113,7 +114,7 @@ def __init__( | |||
# If CSV, preserve the original ordering because 'bq load` matches the | |||
# CSV column with the respective schema entry using the position of the | |||
# column in the schema. | |||
self.sorted_schema = (input_format in {'json', 'dict'}) | |||
self.sorted_schema = (input_format in {'json', 'dict'}) and not keep_input_sort_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run $ make flake8
at the root of the project, you'll see that it will complain about being longer than 80 columns. Break this into smaller lines using something like:
self.sorted_schema = (
(input_format in {'json', 'dict'})
and not keep_input_sort_order
)
@@ -1042,6 +1043,12 @@ def main(): | |||
' This can be fetched with:' | |||
' `bq show --schema <project_id>:<dataset>:<table_name>', | |||
default=None) | |||
parser.add_argument( | |||
'--keep-input-sort-order', | |||
help='Preserve the original ordering of columns from input instead of sorting alphabetically.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break long line < 80 columns
@@ -86,6 +86,7 @@ def __init__( | |||
debugging_map=False, | |||
sanitize_names=False, | |||
ignore_invalid_lines=False, | |||
keep_input_sort_order=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use preserve_input_sort_order
My guess is that at some point, Google changed their internal schema generator tool from Python to Go, and Go's JSON serializer does not preserve ordering of the keys, in keeping with the JSON specification: |
I think the critical piece to all of this is that a BigQuery schema is a JSON array I think the behavior boils down to these questions: When you provide a json schema to When you don't provide a json schema, what does Is |
I agree the BQ schema JSON is ordered because it uses an array. And I think this flag is worth adding. But I have a concern that we are making some assumptions that may be transitory implementation details of the
If subsequent lines in the input data had a different key ordering, like this:
does Do you know any BigQuery documentation that explicitly defines the ordering of the keys in the BQ schema in relation to the ordering of the keys in the input JSON data? |
It also seems tome that the input ordering has further subtleties, because each JSON data line does not need to contain all the keys. So, in the following:
I think your
|
No, I couldn't find anything that dealt with the ordering of columns. https://cloud.google.com/bigquery/docs/schema-detect I don't think we need to be overly concerned with the behavior of However...
seems like a big deal to us, as |
(Correction, C++ |
That's actually a good point.
I'll have to add an entry in the README.md for this flag, so I think we put that warning there.
Python 3.6 should be a decreasing percentage of users of this library, so I'm ok with not guaranteeing that this flag will work in 3.6. Apparently the implementation detail of I want to write some unit tests to verify edge cases, but I think this PR is worth adding. |
Thanks for working through this with me. I'm happy to update the README and add tests, but if you'd prefer to do that yourself then fine by me 😄 |
It'll take me longer to explain the administrative work, than just doing it myself. Thanks for offering though. |
I'll let this simmer for a few days, before pushing a new release. |
Pushed v1.5 to PyPI. Thanks for the work on this! |
Summary:
as the input file for json and dict input formats.
autodetect
functionalityuses to determine order, but as of today it is not alphabetical.